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Abstract 

We consider the problem of storing and retrieving information from synthetic DNA media. The mathematical 
basis of the problem is the construction and design of sequences that may be discriminated based on their collection 
of substrings observed through a noisy channel. This problem of reconstructing sequences from traces was first 
investigated in the noiseless setting under the name of “Markov type” analysis. Here, we explain the connection 
between the reconstruction problem and the problem of DNA synthesis and sequencing, and introduce the notion 
of a DNA storage channel. We analyze the number of sequence equivalence classes under the channel mapping and 
propose new asymmetric coding techniques to combat the effects of synthesis and sequencing noise. In our analysis, 
we make use of restricted de Bruijn graphs and Ehrhart theory for rational polytopes. 


1. Introduction 

Reconstructing sequences based on partial information about their subsequences, substrings, or composition 
is an important problem arising in channel synchronization systems, phylogenomics, genomics, and proteomic 
sequencing OJ—1[3]]. With the recent development of archival DNA-based storage devices (H, Q and rewritable, 
random-access storage media (6j, a new family of reconstruction questions has emerged regarding how to design 
sequences which can be easily and accurately reconstructed based on their substrings, in the presence of read and 
write errors. The write process reduces to DNA synthesis, while the read process involves both DNA sequencing and 
assembly. The assembly procedure is NP-hard under most formulations (TJ. Nevertheless, practical approximation 
algorithms based on Eulerian paths in de Bruijn graphs have shown to offer good reconstruction performance under 
the high-coverage model [[§]]. 

In the setting we propose to analyze, one first synthesizes a sequence xeP = {A, T, G, C} n , and then fragments 
it in the process of sequencing into a collection of substrings of approximately the same length, i. These substrings 
are often referred to as reads. In practice, the length i ranges anywhere between 100 to 1500 ntsQ- Ideally, one 
would like to synthesize x and sequence all f-substrings without errors, which is not possible in practice. For large 
n, the synthesis enor-rate of x is roughly 1 — 3%. Substrings of short length may be sequenced with an enor-rate 
not exceeding 1%; long substrings exhibit much higher sequencing error-rates, often as high as 15%. Furthermore, 
due to non-uniform fragmentation, a number of the substrings are not available for sequencing, leaving coverage 
gaps in the original message. 

To model this read-write phenomenon, we introduce the notion of a DNA storage channel that takes as its input 
a sequence x of length n, introduces s syn substitution errors in x, with the resulting sequence denoted by x. The 
channel proceeds to output all or a subset of substrings of the sequence x of length l, l < n. Each of the substrings 
is allowed to have additional substitution errors, due to sequencing. The total number of substring sequencing errors 

This work was supported in part by the NSF STC Class 2010 CCF 0939370 grant and the Strategic Research Initiative (SRI) Grant 
conferred by the University of Illinois, Urbana-Champaign. Research of the second author was supported by the IC Postdoctoral Research 
Fellowship. This work has been submitted in part to ISIT 2015 and will appear in part at ITW 2015. 

’For our system currently under development, due to the high cost of synthesis, we have chosen n = 1000. In addition, we used multiple 
sequences to increase storage capacity. 
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Fig. 1. The DNA Storage Channel. Information is encoded in a DNA sequence x which is synthesized with potential errors. The output 
of the synthesis process is x. During readout, the sequence x is read through the sequencing channel, which fragments the sequence and 
possibly perturbs the fragments via substitution error. The output of the channel is a set of DNA fragments, along with their frequency count. 


equals s seq . The substrings at the output of the DNA storage channel are collectively enumerated by a vector x, 
termed the channel output (see Fig. Q] for an illustration). 

The main contributions of the paper are as follows. The first contribution is to model the read process (sequencing) 
through the use of profile vectors. A profile vector of a sequence enumerates all substrings of the sequence, and 
profiles form a pseudometric space amenable for coding theoretic analysis. The second contribution of the paper 
is to introduce a new family of codes for the three classes of errors arising in the DNA storage channel due to 
synthesis, lack of coverage and sequencing, and show that they may be characterized by asymmetric errors studied 
in classical coding theory. Our third contribution is a code design technique which makes use of (a) codewords 
with different profile vectors or profile vectors at sufficiently large distance from each other; and (b) codewords 
with (-substrings of high biochemical stability which are also resilient to errors. For this purpose, we consider a 
number of codeword constraints known to influence the performance of both the synthesis and sequencing systems, 
one of which we termed the balanced content constraint. 

For the case when we allow arbitrary (-substrings, the problem of enumerating all valid profile vectors was 
previously addressed by Jacquet et al. 0 in the context of “Markov types”. Flowever, the method of Jacquet et al. 
does not extend to the case of enumeration of profiles with specific (-substring constraints or profiles at sufficiently 
large distance from each other. We cast our more general enumeration and code design question as a problem of 
enumerating integer points in a rational polytope and use tools from Ehrhart theory to provide estimates of the 
sizes of the underlying codes. We also describe two decoding procedures for sequence profiles that combine graph 
theoretical principles and sequencing by hybridization methods. 

2. Profile Vectors and a Metric Space 

Let {q} denote the set of integers {0,1, 2,. .. , q — 1} and consider a word x of length n over [</]]. Suppose that 
t < n. An (.-gram is a substring of x of length (. Let p(x: q, C) denote the ((-gram) profile vector of length g , 
indexed by all words of |q] / ' ordered lexicographically. In the profile vector, an entry indexed by z gives the number 
of occurrences of z as an (-gram of x. For example, p(0000; 2, 2) = (3, 0,0, 0), while p(0101; 2, 2) = (0, 2, 1, 0). 
Observe that for any x € the sum of entries in p(x; q , () equals (n — ( + 1). 

Suppose that the data of interest is encoded by a vector x € [q] n and let x be the output profile of the DNA 
channel. In what follows, we characterize the error vector e = pfx: q, () — x which arises in the process of 
DNA-based data storage and the type of errors captured by this vector. 

(i) Substitution errors due to synthesis. Flere, certain symbols in the word x may be changed as a result of 
eiToneous synthesis. If one symbol is changed, in the perfect coverage case, l (-grams will decrease their 
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counts by one and / /-grams will increase their counts by one. Hence, the error resulting from s syn substitutions 
equals e = e_ — e+, where e + ,e_ > 0, and both vectors have weight s syn /. 

(ii) Substitution errors due to sequencing. Here, certain symbols in each fragment x,; may be changed during 
the sequencing process. Suppose the /-gram x, is altered to x, , x, % x,. Then the count for x ( will decrease 
by one while the count for x, will increase by one. Hence, the error resulting from s seq substitutions equals 
e = e_ — e+, where e + ,e_ > 0, and both vectors have weight s seq . 

(iii) Undersampling errors. Such errors occur when not all /-grams are observed during fragmentation and 
subsequently sequenced. For example, suppose that x = 00000, and that x is the channel output 3-gram 
profile vector. Undersampling of one 3-gram results in the weight of xpoo being four instead of five. Note 
that undersampling of t /-grams results in an asymmetric error e > 0 of weight t. 

Consider further a subset S C {q}'. For x £ [g]"', we similarly define p(x; S ) to be the vector indexed by S, 
whose entry indexed by z gives the number of occurrences of z as an /-gram of x. We are interested in vectors x 
whose /-grams belong to S. Once again, the sum of entries in p(x; S ) equals n — / + 1. 

The choice of S is governed by certain considerations in DNA sequence design, including 

(i) Weight profiles of /-grams. For the application at hand, one may want to choose S to consist of /-grams 
with a fixed proportion of C and G bases, as this proportion - known as the GC-content of the sequence - 
influences the thermostability, folding processes and overall coverage of the /-grams. From the perspective of 
sequencing, GC contents of roughly 50% are desired. 

To make this modeling assumption more precise and general, we assume sets S of the form described below. 
Suppose that 0 < w\ < m 2 < / and 1 < q* < q— 1. Let [mi, m 2 ] denote the set of integers {mi, w\+l, ..., m 2 }. 
For each x € [q]f, let the q*-weight of x be the number of symbols in x that belong to [q — q*,q — 1], and 
denote the weight by wt(x;q*). Let 

S(q,£;q*, [mi,m 2 ]) = jx £ {qf : wt(x; q*) € [mi,m 2 ]j 

be the set of all sequences with /-gram weights restricted to [mi,m 2 ]. For example, representing .4, T.G, C 
by 0,1,2,3, respectively, and setting q = 4 and q* = 2, the choice w\ = |_£/2J, m 2 = w\ + 1 enforces the 
balanced GC constraint. Also, note that 5(q,/;q*, [0,/]) = [q]f, for any choice of q*. 

(ii) Forbidden /-grams. Studies have indicated that certain substrings in DNA sequences - such as GCG, CGC - 
are likely to cause sequencing errors (see fTOlO . Hence, one may also choose S so as to avoid certain /-grams. 
Treatment of specialized sets of forbidden /-grams is beyond the scope of this paper and is deferred to future 
work. 

Therefore, with an appropriate choice of S, we may lower the probability of substitution errors due to synthesis, 
lack of coverage and sequencing. Furthermore, as we show in our subsequent derivations, a carefully chosen set S 
may improve the error-correcting capability by designing codewords to be at a sufficiently large “distance” from 
each other. Next, we formally define the notion of sequence and profile distance as well as error-correcting codes 
for the corresponding DNA channel. 

A. Error-Correcting Codes for the DNA Storage Channel 

Fix S C Iqf. Let N be an integer which usually denotes the number of /-grams in the profile vector, i.e. N = \S\. 
Define the L\-weight of a word u € Z> 0 as wt(u) = Y,iLi u i■ In addition, for any pair of words u, v € Z^ 0 , let 
A(u, v) A max(rq — Vi, 0) and define the asymmetric distance as d asym (u, v) = max (A(u, v), A(v, u)). A 
set C is called an (A, d)-asymmetric error correcting code (AECC) if C C Z> 0 and d = min{ci asym (x, y) : x, y £ 
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C,x ^ y}. For any x £ C, let e £ Z> 0 be such that x — e > 0. We say that an asymmetric error e occurred if the 
received word is x — e. We have the following theorem characterizing asymmetric error-correction codes (see ifTTl 
Thm 9.1]). 

Theorem 2.1. An (N, d + 1)-AECC corrects any asymmetric error of weight at most d. 

Next, we let (|q] n ; S) denote all g-ary words of length n whose (-grams belong to S and define the l-gram 

distance between two words x, y £ (M" ; S) as 

dgram(x, y; S) = ym(p(x; s), p(y; S )). 

Note that d gram is not a metric, as d gram (x, y; S) = 0 does not imply that x = y. For example, we have 
rfgram(0010,1001; [2]] 2 ) = 0. Nevertheless, (([g] n ; S), d gram ) forms a pseudometric space. We convert this space 

into a metric space via an equivalence relation called metric identification. Specifically, we say that x n y if 

and only if of gram (x, y; S ) = 0. Then, by defining Q{n\ S) = (|g]] n ; S)/ d ~ m , we can make (Q(n; S), d gram ) into a 
metric space. An element X in Q(n; S ) is an equivalence class, where x, x' £ X implies that p(x; S ) = p(x'; S ). 
We specify the choice of representative for X in Section[7]and henceforth refer to elements in Q(n: S ) by their repre¬ 
sentative words. Let pQ(n; S ) denote the set of profile vectors of words in Q(n; S). Hence, |pQ(n; S')) = |Q(n; S)\. 

Let C C Q(iv,S). If d = min{<i gram (x, y; () : x, y £ C,x / y}, then C is called an (n,d; S)-£-gram 
reconstruction code (GRC). The following proposition demonstrates that an (-gram reconstruction code is able 
to correct synthesis and sequencing errors provided that its (-gram distance is sufficiently large. We observe that 
synthesis errors have effects that are ( times stronger since the error in some sense propagates through multiple 
(-grams. 

Proposition 2.2. An (n, d\ ,S')-GRC can correct s syn substitution errors due to synthesis, s seq substitution errors 
due to sequencing and t undersampling errors provided that d > 2s syn ( + 2s seq + t. 

Proof: Consider an (n, d; 5)-GRC C and the set p(C) = {p(x;S) : x £ C}. By construction, p(C) is an 
( N , d)-AECC with N = |,S’| that corrects all asymmetric errors of weight < 2s syn ( + 2s seq + t. 

Suppose that, on the contrary, C cannot correct s ayn substitution errors due to synthesis, s seq substitution errors 
due to sequencing and t undersampling errors. Then, there exist two distinct codewords x, x' £ C and error vectors 
e Sl ,+,e Sl _,e S2i+) e S2 ,_,e t ,e' Si + ,e' Si _,e' 2i+ ,e' 2 _,e' t , such that X = x', that is, such that 

x + e Sli+ — e Sli _ + e S2i+ - e S2 _ — e t = x' + e ai + — e ai _ + e(, 2 + — e' S2] _ — e t . 

Here, e Slj _ — e Sl)+ and e' Si _ — c( + are the error vectors due to substitutions during synthesis in x and x', 
respectively; each of the vectors e Sli _, e Sli+ , e' Si , e' + has weight s syn (; the vectors e S2i _ — e S2 + and e' 2 e' + 
model substitution errors during sequencing in x and x', respectively; each of the vectors c., 2 _. c. S2:+ . c( 2 _, e( 2 + 
has weight s seq ; and e t and c[ are the undersampling error vectors of x and x', respectively, and both e t ,e' t have 
weight t. Therefore, 


x - (e Sl) - + e S2; _ + e t + e' ij+ + e' S2j+ ) — x' - (e' Sli _ + e' S2j _ + e [ + i 


' 51 } + 


+ e, 


S2,H“ 


), 


where e Sl) _ + e S2; _ + e t + e' S] + + e' 2 + and e' Si _ + e' S2 _ + e' t + e Sli+ + e S2i+ are nonnegative vectors of weight 
at most 2s ayn ( + 2s aeq + 1. This contradicts the fact that x and x' belong to a code that corrects asymmetric errors 
with weight at most 2s ayn ( + 2s aeq + t. ■ 

Throughout the remainder of the paper, we consider the problem of enumerating the profile vectors in pQ(n; S ) 
and constructing (n, d: S’)-(-gram reconstruction codes for a general subset S C [[r/J ‘. Our solutions are characterized 
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by properties associated with a class of graphs defined on S, which we introduce in Section [3] In the same section, 
we collect enumeration results for Q(n; S). Section [4] is devoted to the proof of the main enumeration result using 
Ehrhart theory. We further exploit Ehrhart theory and certain graph theoretic concepts to construct codes in Section 
Hand summarize numerical results for the special case where S = S(q, £\ q*, [w\, W 2 ]) in Section [ 6 ] Finally, we 
describe practical decoding procedures in Section [7] 

Remark 1. 

(i) For the case S = [[(/if, given a word x, I I kk onen made certain observations on the structure of certain words 
in the equivalence class of x, but was unable to completely characterize all words within the class lfT2l . Here, 
we focus on computing the number of equivalence classes for a general subset S. 

(ii) For ease of exposition, we abuse notation by identifying words in Q(n\ S ) with their corresponding profile 
vectors in pQ(n;S). and refer to GRCs as being subsets of Q(n; S ) or p Q(n;S) interchangeably. 

3. Restricted De Bruijn Graphs and Enumeration of Profile Vectors 

We use standard concepts and terminology from graph theory, following Bollobas 1 1X31 . 

A directed graph (digraph) D is a pair of sets (V,E), where V is the set of nodes and E is a set of ordered 
pairs of V, called arcs. If e = (v, v') is an arc, we call v the initial node and v' the terminal node. We allows 
loops in our digraphs: in other words, we allow v = v'. 

The incidence matrix of a digraph D is a matrix B(Z?) in {—1, 0, l} VxE , where 

1 if e is not a loop and v is its terminal node, 

B (D) v>e = -1 if e is not a loop and v is its source node, 

I 0 otherwise. 

Observe that when a digraph D has loops, its incidence matrix B (D) has 0-columns indexed by these loops. When 
D is connected, it is known that the rank of B(Z?) equals |V| — 1 (see fl3l §11, Thm 9 and Ex. 38]). 

A walk of length n in a digraph is a sequence of nodes vqV\ ■ ■ ■ v n such that (v{, Vi+\) £ E for all i £ [n]. A walk 

is closed if vo = v n and a cycle is a closed walk with distinct nodes, i.e., Vi 7 ^ Vj, for 0 < i < j < n. We consider a 

loop to be a cycle of length one. Given a subset C of the arc set, let x(C) £ {0,1}^ be its incidence vector, where 
x(C) e is one if e £ C and zero otherwise. In general, for any closed walk C in D, we have H(D)x(C) = 0. A 
digraph is strongly connected if for all z, z' £ V(S), there exists a walk from z to z' and vice versa. 

We are concerned with a generalization of a family of digraphs, namely, the de Bruijn graphs Ifl4l . Given q and £, 
the standard de Bruijn graph is defined on the set |(/[] /; . Here, we fix a subset S C |(/]]' and define the corresponding 
restricted de Bruijn graph, denoted by D(S). The nodes of IMS) are the (£ — l)-grams appearing in of words in 
S. The pair (v, v') belongs to the arc set if and only if Vi = v[_^ for 2 < i < £ and v\v -2 ■ ■ ■ V£-\v\_ x £ S. Hence, 
we identify the arc set with S and denote the node set as V(S). 

The notion of restricted de Bruijn graphs was introduced by Ruskey et al. liT5l for the case of a binary alphabet. 
In their paper, Ruskey et al. showed that D(S ) is Eulerian when S = 5(2, £\ 1, [w — 1, w] ) for w £ [(]. Nevertheless, 
the results of lfT5l can be extended for general q, q* and more general range of weights. As these extensions are 
needed for our subsequent derivation, we provide their technical proofs in Appendix [A] For purposes of brevity, 
we write D(S(q,£;q* , [ 101 , 102 ])) and Z)(|(/]] £ ) as D(q,£\q*, [ 101 , 102 ]) and D(q,£), respectively. 

Proposition 3.1. Fix q and £. Let 1 < q* < q — 1 and 1 < w± < W 2 < £. Then D(q, £\ q* , [ 101 , 102 ]) is Eulerian. In 
addition, D(q,£ ) is Hamiltonian. 
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Fig. 2. The restricted de Bruijn graphs defined on sets |[2J 3 and S( 2, 4; 1, [2, 3]). We represent the respective 3- and 4-gram profile vectors 
for 0001000 and 011001101011 using their respective restricted de Bruijn graphs. 


Observe that when q* = q — 1, w\ = 0, W 2 = £, we recover the classical result that the de Bruijn graph D(q ,£) 
is Eulerian and Hamiltonian. 

A. Enumerating Q(n; S ) 

In this subsection, we provide the main enumeration results for Q(n; S ), or equivalently, for p Q(n; S ). We first 
assume that D(S) is strongly connected. In addition, we consider closed wa lk s in D(S), or equivalently, closed 
words that start and end with the same (£ — l)-gram. We denote the set of closed words in Q(n: S ) by Q(n; S), 
and the corresponding set of profile vectors by p Q(n;S). 

Suppose that u belongs to pQ(n; S). Then the following system of linear equations that we refer to as the flow 
conservation equations, hold true: 

B(D(S))u = 0. (1) 

Let 1 denote the all-ones vector. Since the number of /-grams in a word of length n is n — £ + 1, wc also have 

l T u = n — £ + 1 . ( 2 ) 

Let A(iS') be B (D(S)) augmented with a top row l r ; let b be a vector of length |V(S')| + 1 with a one as its 
first entry, and zeros elsewhere. Equations <[T|) and (0 may then be rewritten as A(5)u = (n — £ + l)b. 

Consider the following two sets of integer points 

T(n; S) = { u G : A(S)u = (n - £ + l)b, u > 0}, (3) 

£(n; S) = { u G Z lsl : A(S)u = (n - £ + l)b, u > 0}. (4) 

The preceding discussion asserts that the profile vector of any closed word must lie in T(n',S). Conversely, the 

next lemma shows that any vector in £(n; S ) is a profile vector of some word in Q(n; S ). 
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Lemma 3.2. Suppose that D(S) is strongly connected. If u £ £(n; S), then there exists a word x £ Q(n; S ) such 
that p(x; S ) = u. That is, £{n ; S) C pQ(n; S). 

Proof: Construct a multidigraph D' on the node set V(S) by adding u z copies of the arc z for all z e V(S). 
Since each u z is positive and D(S) is strongly connected, D' is also strongly connected. Since u € £{n\q,l), u 
also satisfies the flow conservation equations and D' is consequently Eulerian. Also, as D' has n — £ + 1 arcs, an 
Eulerian walk on D' yields one such desired word x. ■ 

Therefore, we have the following relation, 

£{n\ S) C pQ(n; S) C F{n\ S). (5) 

We first state our main enumeration result and defer its proof to Section |4| Specifically, under the assumption 
that D(S ) is strongly connected, we show that both |£(n;S , )| and |7 r (n;S')| are quasipolynomials in n whose 
coefficients are periodic in n. Following Beck and Robins Ifl 6 l . we define a quasipolynomial f as a function in n 
of the form co{t)t d + co-i{t)t D ~ l + • • • + co(t), where cd, cd_i, ..., Co are periodic functions in n. If cr> is not 
identically equal to zero, / is said to be of degree D. The period of / is given by the lowest common multiple of 
the periods of cd, cd- i, ..., Co- 

In order to state our asymptotic results, we adapt the standard f l and © symbols. We use f(n) = £l r (g(n)) to state 
that for a fixed value of £, there exists an integer A and a positive constant c so that f(n) > cg(n ) for sufficiently 
large n with A|(n — 1+ 1). In other words, /(n) > cg(n) whenever n is sufficiently large and is congruent to i — 1 
modulo A. We write f(n ) = 0'(p(n)) if f(n) = 0(g(n)) and /(n) = £l'(g(n)). 

Theorem 3.3. Suppose D(S ) is strongly connected and let A be the least common multiple of the lengths of all 
cycles in D(S). Then \£(n; S)| and |.F(n; S)\ are both quasipolynomials in n of the same degree |£| — \V (£)| and 
share the same period that divides A. In particular, |p Q(n; S)| = 0' l v ’( s ')l). 

Before we end this section, we look at certain implications of Theorem 13.31 First, we show that the estimate on 
|pQ(n;5)| extends to |pQ(n;5)| when D(S) is strongly connected. 

Corollary 3.4. Suppose D(S) is strongly connected. For any z, t! £ V(S), consider the set of words in Q(n; S) 
that begin with z and end with z' and let pQ(n: S. z z') be the corresponding set of profile vectors. Similarly, 
let pQ(n; 5, z —>• *) and pQ(n; S, * —> z') denote the set of profile vectors of words beginning with z and words 
ending with z', respectively. Then 

|pQ(n;S')| = 0'(|pQ(n;5, z ->• z')|) = 0'(|pQ(n; S', * ->• z')|) = 0'(|pQ(n;S, z ->• *)|) = 0' ^n |s|_|v . 

Proof: Let z, z' £ V (S). Since D(S) is strongly connected, we consider the shortest path from z to z' in D(S). 
Let w = zw' be the corresponding q-ary word, p( z, z') = |w'| be the length of the path and u(z —> z') = p(w; S) 
be its profile vector. Observe that both the length p( z, z') and the vector u(z —>• zl) are independent of n. 

We demonstrate the following inequality: 

\£(n — p(z, z'); S)| < |pQ(n;S,z z')| < \pQ(n + p(z, z'); S)\. (6) 

First, we construct a map : £(n — p( z, z'); S) —> pQ(n; S, z — > z') defined by u £ u + u(z —> z'). Now, 
since u £ £{n — p( z, z'); S), we can assume that u is the profile vector of a word x of length (n — p( z, z')) that 
starts and ends with z. Then xw' is a word of length n whose profile vector lies in pQ(n; S, z —> z’). Hence, (f 
is a well-defined map and it can be easily shown that the map is injective. Therefore, the first inequality holds. 

Similarly, for the other inequality, we consider the map ■ pQ{ji\S,z —> z') —> pQ(n + p(z',z);S);u i-> 


u + u(z' —> z). As before, let u be the profile vector of a word x of length n that starts with z and ends with zl. 
Let w = z'w' be the g-ary word corresponding to the shortest path from z! to z in I)(S). Concatenating x with 
w' yields xw', which is a word of length n + p(z' , z). The profile vector of this word lies in p Q{n + p(zl , z); S ). 
Hence, 62 is a well-defined map and is injective. 

Combining © with the fact that \£(n; S')! = ©' (nJ s ’H' / ('S')l) and |pQ(n; S’)| =0' yields the result 

|pQ(n; S , z, z')| = 0 ' (n^A 11 ^I). 

Next, we demonstrate that |pQ(n;S)| = O' (n^’ 1 and observe that the other asymptotic equalities may 

be derived similarly. Let P = max{p(z, z') : z,z' £ V'(S')} be the diameter of the digraph D{S). Then, 

IpQ^S^H \Q(n;S,z,z')\ < Y | + p(z', z); S)| 

z.z 'ev(S) z,z 'GV(S) 

< lV(S)[ 2 IQ(n + P;S)\ = O (n |s H y(S)l ) . 

Since Q(n; S) > Q(n; S) = O' the corollary follows. ■ 

In the special case where S = [g]] £ , Jacquet et al. demonstrated a stronger version of Theorem 13.31 using analytic 
combinatorics. In addition, using a careful analysis similar to the proof of Corollary 13.41 Jacquet et al. also provided 
a tighter bound for |pQ(n;g,£)| for the case t = 2. Note that f(n ) ~ g(n) stands for linin^oo f(n)/g{n) = 1. 

Theorem 3.5 (Jacquet et al. (3). Fix q,£. Let £(n; [g]f), -C(n; [g]]^), p Q(n;q,£) and pQ(n;g,^) be defined as 
above. Then 

I £(n; lqf)\ ~ | F(n\ M')l ~ \pQ(n,q,£)\ ~ c(q,£)n^-\ (7) 


where c(g, i) is a constant. Furthermore, when 1 = 2, we have |pQ(n; g, 2)| = (g 2 —g+l)|pQ(n; g, 2)|(1— 0(n 2q )). 

Next, we extend Theorem 13.31 to provide estimates on Q{n\S) and Q(n; S) for general S, where D(S) is not 
necessarily strongly connected. 

Given D(S), let V), V 2 , ■ ■ ■, Vi be a partition of V ( S ) such that the induced subgraph (Vj, Si) is strongly connected 
for all i € [/]. Define 4, = |,3’,;| — \V t \. Then by Theorem 13.31 there are ®'(n Si ) closed words belonging to Q(n\ Si) 
and therefore, Q(n; S ). Suppose A = rnax-jA :*€/}. Then Q(n; S) = £l'(n^). 

On the other hand, any closed word x in Q(n; S ) corresponds to a closed walk in D(S) and a closed walk in 
D(S ) must belong to some strongly connected component (Vj, Si). In other words, x must belong to Q(n; Si) for 
some i £ [/]. Hence, we have | Q(n\ 5)| = 0(n A ). 

Corollary 3.6. Given D(S), let Vj, V 2 ..... Vj be a partition of V(S) such that the induced subgraph (Vj , Si) is 
strongly connected for all i £ I. Define A = max{|S ? ; — |Vj| : i £ I}. Then | Q(n; 5*)| = ®'(n A ). 


Example 3.1. Let g = 4, i = 2 S = {00,01,10,12, 23, 32,33}. Then D{S) is as shown below. 
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We have two strongly connected components, namely, Vj = {0,1} and Vj = {2,3}. So, (Vi, 5) = {00,01,10}) 
and (V^Sj = {23,32,33}) ai'e both strongly connected digraphs with \pQ(n; Sj)| = |pQ(n; 52 )| = L n /2j + 1 = 
0'(n). Hence, |pQ(n;5)| = |pQ(n;5i)| + |pQ(n;52)| = 0'(n), in agreement with Corollary 13.61 

On the other hand, let us enumerate the elements of Q(n\S) or pQ(n;5). Let u £ pQ(n;S). If u 12 = 0, 
then u belongs to pQ(n;£j) or pQ(n; 52 ). Otherwise, u \2 = 1 an d we have u = ui + %(12) + 112 with ui £ 
pQ(ni; Si, * —>■ 1), u 2 £ pQ(n 2 ; 5 2 , 2 — > *) and m + n 2 + 1 = n - £ + 1. 
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Fig. 3. Constructing a weighted digraph from the connected components of D(S). 

Now, |p Q(n; 51)1 = |pQ(n; S 2 )| = n + \n/2] + 1 and |pQ(n; Si, * ->• 1)| = |pQ(n; S 2 , 2 ->■ *)| = n + 1. By 
setting |pQ(0; Si, * —> 1)| = |pQ(0; S 2 ,2 —*)| = 1, we arrive 

n— 1 

|pQ(n;5)| = |pQ(n; Si)\ + |pQ(n; 5 2 )| + ^ |pQ(ni; Si,*-> l)||pQ(n - ni - 1;5 2 ,2 *)| 

ni=0 

n —1 

= 2n + 2 \n/2 ] + 2 + (ni + l)(n — n\) 

ni=0 

= 2n + 2 [n/21 + 2+ -n(n + l)(n + 2) = 0'(n 3 ). 

6 

Therefore, when D(S ) is not strongly connected, it is not necessarily true that pQ(«; S) and |pQ(n;5)| differ 
only by a constant factor. Furthermore, we can extend the methods in this example to obtain |pQ(n; S) for digraphs 
that are not necessarily strongly connected. 

To determine pQ(n: 5)|, we construct an auxiliary weighted digraph with nodes v\, v 2 ,..., vj , i; sourC e and n s ; n k. 
If there exists an arc from the component V t to component Vj, i,j € [/], we add an arc from v,, to v :] . Further, we 
add an arc from n source to v t and from v, to n s i n k for all i G [/]. The arcs leaving n source have zero weight. For all 
i £ [/], the arcs leaving u, have weight 6 t = S t \ — \Vi\ if their terminal node is '(; sin k, and weight S t + 1 otherwise, 
(see Fig. [3] for the transformation). 

Let D' be the resulting digraph and observe that D' is acyclic. Flence, we can find the longest weighted path 
from n S ource to i ! s j n i in linear time (see Ahuja et al. 1171 Ch. 4]). Furthermore, suppose that A is the weight of the 
longest path. Then the next corollary states that |pQ(n; S) = 0'(n A ). 

Corollary 3.7. Given D(S), let Vi,V 2 ,... ,Vj be a partition C(5) such that the induced subgraph (Vi, Si) is 
strongly connected for all i £ I. Construct D' as above (see Fig. [3) and let A be the weight of the longest weighted 
path from n SOU rce to n sink . Then |pQ(n;5)| = @'(n A ). 

Proof: Let u £ pQ(n; S). Then there exists a set of indices {ii,i 2 , ■ ■ ■ ,it) G [I], set of vectors m, u 2 ,..., u t , 
ei, e 2 ,..., e t _i, and integers ni,n 2 ,..., nt such that the following hold: 

. u = ui + ei + ui + ei 4-h e t -i + u t ; 

• for j £ [t — 1], e ? is the incidence vector of some arc (z j, Zj+i) in D(S) and zj £ S t/ ; 

• ui € pQ(ni; Si 1} * zi), u f £ pQ(n t ; S it , z t -£ *) and Uj 6 pQ(nf, S tj ,z 3 -t z j+ i) for 2 < j < t - 1; 

. (t - 1) + Y?j=i n j = n - i + 1; 
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• ^source^2 ' ' ' A/. Wink IS & puth ill I) . 

For a fixed subset {zi,* 2 , ■■■,**} C [I], write n' = (n — £ + 1) — (t — 1). Observe that 

t-i 

Y \pQ( n r> s ii,* -> z i)IIpQK; S it ,z t ->■ *)l ->• z j+1 )\ 

J2 n i=n' j= 2 

n' n'—ni n'—n i- n t ~ 2 

-EE- E o 

ni=0 n 2 =0 n t _i=0 

The last equality follows from the fact that (t— 1)+^3 =1 W, measures the weight of WourceWi A; 2 ■ ■ ■ WtWink and this 
value is upper bounded by A. Since the number of subsets of [/] is independent of n, we have |pQ(n; S)\ = 0(n A ). 

Conversely, suppose n sourc eV^v^ ■ ■ ■ Vj t v R \ n y is a path in D' of maximum weight A. Define z j, S lj , rij and n! 

as before. We then have 

i-i 

|pQ(n;S’)| > Y IpSOi; 5 ^,* -> z i)\\pQ{nt; s it ,z t *)| |pQ(n i ;5 ij .,z j -a- z j+1 )\ 

Y. n i=n' 3 = 2 

> Y CMinify 

> Y^ C- 2 n Sil+SilH (by Jensen’s inequality) 

E nj=n' 

> C 3 n Si i+ 5i i+-+ s n+( t ~ 1 ) = C 3 n A , 


n 


+<5 i:L H-b<5 i( \ _ 


) = 0 (n Si i+ 5 n+-+‘5i t +(i- 1 )^ = 0(n A ). 


where C\, C 2 and C'-> are positive constants. Therefore, pQ(n: S)\ = fF(n A ), completing the proof. ■ 

4. Ehrhart Theory and Proof of Theorem [33] 

We assume D(S) to be strongly connected and provide a detailed proof of Theorem 13.31 For this puipose, In the 
next subsection, we introduce some fundamental results from Ehrhart theory. Ehrhart theory is a natural framework 
for enumerating profile vectors and one may simplify the techniques of J9} significantly and obtain similar results for 
a more general family of digraphs. Furthermore, Ehrhart theory also allows us to extend the enumeration procedure 
to profiles at a prescribed distance. 

A. Ehrhart Theory 

As hinted by © and ©, to enumerate codewords of interest, we need to enumerate certain sets of integer points 
or lattice points in polytopes. The first general treatment of the theory of enumerating lattice points in polytopes 
was described by Ehrhart Ifl 8 ll . and later developed by Stanley from a commutative-algebraic point of view (see 
lfl9l Ch. 4]). Flere, we follow the combinatorial treatment of Beck and Robins lfl 6 l . 

Consider any rational polytope V given by 

V = {u € R n : Au < b}, 

for some integer matrix A and some integer vector b. A rational polytope is integer if all its vertices arc integral. 
The lattice point enumerator L-pit) of V is given by 


Lp(t) = |Z n n tV I, for all t G Z >0 . 
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Ehrhart lfl 8 l introduced the lattice point enumerator for rational polytopes and showed that Lp (t) is a quasipoly¬ 
nomial of degree D, where I) is given by the dimension of the polytope V. Here, we define the dimension of a 
polytope to be the dimension of the affine space spanned by points in V. A formal statement of Ehrhart’s theorem 
is provided below. 

Theorem 4.1 (Ehrhart’s theorem for polytopes lfl 6 l Thm 3.8 and 3.23]). If V is a rational convex polytope of 
dimension D, then Lp(t) is a quasipolynomial of degree D. Its period divides the least common multiple of the 
denominators of the coordinates of the vertices of V. Furthermore, if V is integer, then Lp(t ) is a polynomial of 
degree D. 

Motivated by ©, we consider the relative interior of V. For the case where V is convex, the relative interior, 
or interior, is given by 

V° = {u £ V : for all u' £ V, there exists an e > 0 such that u + e(u — u') £ V}. 

For a positive integer t, we consider the quantity 

Lpo(f) = \z n ntv°\. 

Ehrhart conjectured the following relation between Lp(t) and Lpo(t), proved by Macdonald lf20l . 

Theorem 4.2 (Ehrhart-Macdonald reciprocity lfl 6 l Thm 4.1]). If V is a rational convex polytope of dimension D, 
then the evaluation of Lp{t) at negative integers satisfies 

Lp(-t) = {-l) D Lpo(t). 


B. Proof of Theorem 13.31 

Recall the definitions of A(5) and b in (Q, and consider the polytope 

V(S) = {u £ Rl S l : A(S)u = b, u > 0}, ( 8 ) 

Using lattice point enumerators, we may write |J r (n;5)| = T-p{S)(n — i + 1). Therefore, in view of Ehrhart’s 
theorem, we need to determine the dimension of the polytope V(S) and characterize the interior and the vertices 
of this poly tope. 

Lemma 4.3. Suppose that D(S ) is strongly connected. Then the dimension of V{S) is \S\ — |U(5)|. 

Proof: We first establish that the rank of A (S) is |V’(S')|. Since D(S ) is connected, the rank of B (D(S)) is 
\V(S)\ — 1. We next show that \ T does not belong to the row space of B (D(S)). As D(S) is strongly connected, 
D(S ) contains a cycle, say C. Since B (D(S))x(C) = 0 but 1 x(C) = |Cj / 0, 1 does not belong to the row 
space of B (D(S)), so augmenting the matrix with the all-one row increases its rank by one. Therefore, the nullity 
of A(S) is \S\ - |U(S)|. 

Next, we show that there exists a u > 0 such that A(S')u = b. Since the nullity of B (D(S)) is positive, there 
exists a u' such that A(S)u' = b. Since D(S) is strongly connected, there exists a closed walk on D(S) that visits 
all arcs at least once. In other words, there exists a vector v > 0 such that A(,S')v = pb for // > 0. Choose //' 
sufficiently large so that u' + p'v > 0 and set u = (u' + p'v)/(l + /r'/r)- O ne can easily verify that A(S’)u = b. 

To complete the proof, we exhibit a set of 15*1 — |U(S , )| + 1 affinely independent points in V(S). Let m, U2, ..., 
U|5|_|v’(,S') be linearly independent vectors that span the null space of A(S'). Since u has strictly positive entries, 
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we can find e small enough so that u + eu, belongs to 'P(S) for all i G [(S'! — |V r (S')|]. Therefore {u. u + eU]. u + 
eu 2 ,..., u + eu| 5 |_|y( 5 )|} is the desired set of |5| — |H(S')| + 1 affinely independent points in V(S). ■ 

Lemma 4.4. Suppose D(S ) is strongly connected. Then V°(S) = {u G : A(5)u = b, u > 0}. Therefore, 

*5)1 = L r °( S )[n- 1 + 1). 

Proof: Let u > 0 be such that A(5)u = b. For any u' G T’(S'), we have A(5)u' = b and hence, A(S)(u — 
u') = 0. Since u has strictly positive entries, we choose e small enough so that u + e(u — u') > 0. Therefore, 
u + e(u — u') belongs to V(S) and u belongs to the interior of V(S). 

Conversely, let u G V(S), with u z = 0 for some z G S. Since D(S) is strongly connected, from the proof of 
Lemma l4~3l there exists a u' G V(S) with u' > 0. Hence, for all e > 0, the z-coordinate of u + e(u — u') is given 
by —eu' z , which is always negative. In other words, u does not belong to V°(S). ■ 

Therefore, using Ehrhart’s theorem and Ehrhart-Macdonald reciprocity along with Lem mas 14.31 and 14.41 we arrive 
at the fact that |£(n;£')| and |J r (n;5)| are quasipolynomials in n whose coefficients are periodic in n. 

In order to determine the period of the quasipolynomials, we characterize the vertex set of V(S). A point v in 
a polytope is a vertex if v cannot be expressed as a convex combination of the other points. 

Lemma 4.5. The vertex set of V{S) is given by {x(C)/\C\ : C is a cycle in D(S)}. 

Proof: First, observe that x(C)/\C\ belongs to T(S) for any cycle C in D(S). 

Let v G V(S) and suppose v is a vertex. Since A(5) has integer entries, v is rational. Choose g > 0 so that 
/rv has integer entries. Construct the multigraph D' on V ( S) by adding /jv z copies of the arc z for all z G S'. 
Since v G V{S), B(S)//v = 0 and hence, each of the connected components of If are Eulerian. Therefore, the 
arc set of D' can be decomposed into disjoint cycles. Since v is a vertex, there can only be one cycle and hence, 
v = x{C)/\C\ for some cycle C. 

Conversely, we show that for any cycle C in D(S), x(C)/|Cj cannot be expressed as a convex combination of 
other points in V(S). Suppose otherwise. Then there exist cycles Cj, Cf.... ,Ct distinct from C and nonnegative 
scalars ai, « 2 ,... ,at such that x{C) = Yll=i a iX(Ci). For each j, let ej be an arc that belongs to Cj but not C. 
Then 

o = x(C) ej = a iX(Ci) ej > (XjXiCj) ej = OLj. 

1 <i<t 

Hence, ay = 0 for all j. Therefore, %(C) = 0, a contradiction. ■ 

Let As = lcm{|Cj : C is a cycle in D(S)}, where 1cm denotes the lowest common multiple. Then the period 
of the quasipolynomial L-p(s)(n — f + 1) divides A s by Ehrhart’s theorem. 

Let us dilate the polytope V{S) by A s and consider the polytope A s'P(S) and L As -p(s)(^)- Since A sV is integer, 
both Lxg'p^it) an d C As ^o(s)(f) are polynomials of degree |Sj — |V'(5)|. Hence, 

|Q(n; S’)| > C As -po(s)(f) = H , whenever n- 1+1 = A st or As|(n - i + 1), 

and therefore, |Q(n;S')| = 0' j^is completes the proof of Theorem 13.31 

In the special case where I)(S) contains a loop, we can show further that the leading coefficients of the 
quasipolynomials |£(n; [[qj^)! and \F(n, [g]^)| are the same and constant. This result is a direct consequence of 
Ehrhart-Macdonald reciprocity and the fact that \£(n; fyjf jl is monotonically increasing. We demonstrate the latter 
claim in Appendix |B] 

Note that when S = [g] £ , Corollary 14.61 yields (Q, a result of Jacquet et al. |@. 
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Corollary 4.6. Suppose D(S) is strongly connected. If I)(S) contains a loop, then 

|5(n;5)| ~ |Q(n;5)| ~ |J r (n;5)| ~ c(£)ra^ - ^ ^ + 0(71^“^ ^ _1 ), for some constant c(S). (9) 

5. Constructive Lower Bounds 

Fix 5 C |q ['' and recall that pQ(n: S) denotes the set of all A gram profile vectors of words in Q{n\S ). For 
ease of exposition, we henceforth identify words in Q(n; S ) with their corresponding profile vectors in pQ(n; S). 
In Section [7] we provide an efficient method to map a profile vector in pQ(n; S) back to a g-ary codeword in 
Q(n;S), Therefore, in this section, we construct GRCs as sets of profile vectors pQ(n; S) which we may map 
back to corresponding q- ary codewords in Q(n;S). 

Suppose that C is an (N, d)- AECC. We construct GRCs from C via the following methods: 

(i) When N — [S'!, we intersect C with pQ(n; S) to obtain an A gram reconstruction code. In other words, we 
pick out the codewords in C that are also profile vectors. Specifically, C (~l pQ(n;5) is an (n,d;S)- GRC. 
Flowever, the size | C D p Q(n; 5)| is usually smaller than \C\ and so, we provide estimates to | C fl p Q(n; 5)| 
for a classical family of AECCs in Section 15-AI 

(ii) When N < |,S'|, we extend each codeword in C to a profile vector of length |,S’| in p Q(n;q,£). In contrast 
to the previous construction, we may in principle obtain an (n,d;q,£)- GRC with the same cardinality as C. 
Flowever, one may not always be able to extend an arbitrary word to a profile vector. Section 15-BI describes 
one method of mapping words in Jm] /V to p Q(n;q,£) that preserves the code size for a suitable choice of 
the parameters m and N. 

A. Intersection with p Q(n;S) 

In this section, we estimate |CflpQ(n; S)\ when C belongs to a classical family of AECCs proposed by Varshamov 
ll2Dl . Fix d and let p be a prime such that p > d and p > N . Choose N distinct nonzero elements a\, ct2, ■ ■ ■, ajsr 
in Z/pZ and consider the matrix 



/ 

«2 • 

• OLN 


af 

a\ • 

' a N 


V “l 

«2 • 

■ a d N 


Pick any vector (3 € (Z/pZ) :V and define the code 

C(H, (3) = {u : Hu = /3 mod p}. 

Then, C(H, (3) is an (N, d+l)-AECC ED. Hence, C(H, /3)n P Q(n; S ) is an (n, d+ 1; S)-GRC for all (3 e {Z/pZ) N . 
Therefore, by the pigeonhole principle, there exists a (3 such that |C(H,/3) H pQ(n; ,S’)| is at least pQ(n; S)\/p' 1 . 
However, the choice of (3 that guarantees this lower bound is not known. 

In the rest of this section, we fix a certain choice of H and /3 and provide lower bounds on the size of 
C(H ,/3) fl p Q(n; S ) as a function of n. As before, instead of looking at p Q(n; S ) directly, we consider the set of 
closed words Q(n;S) and the corresponding set of profile vectors pQ(n;S'). 

Let (3 = 0 and choose H and p based on the restricted de Bruijn digraph D(S). For an arbitrary matrix M, let 
Null>oM denote the set of vectors in the null space of M that have positive entries. We assume D(S) to be strongly 
connected so that Null>oB(L>(5)) is nonempty. Hence, we choose H and p such that C(H, 0) n Null > oB(H(S 1 )) 
is nonempty. 
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Define the (|V(5)| + 1 + d) x (\S\ + (i)-matrix 


A(H,5)4 


MS) 

0 

H 

-pld 


where A (S') is as described in Section [3] Let b be a vector of length |V(S)| + 1 + d that has 1 as the first entry 
and zeros elsewhere, and define the polytope 


Pgrc(H, S) 4 {U G M |S|+d : A(H, S)u = b, u > 0} 


( 10 ) 


Since £(n;S) C pQ(n;S) C pQ(n;S), |C(H, 0) n£(n;S)| is a lower bound for |C(H, 0) npQ(n;S)|. The 
following proposition demonstrates that |C(H, 0 ) n £ (n; S)| is given by the number of lattice points in the interior 
of a dilation of PgrcKH, S). 


Proposition 5.1. Let C( H, 0) and "PgrcIH, S) be defined as above. If D(S) is strongly connected and C(H, 0) n 
Null >0 B(D(S)) is nonempty, then |C(H, 0) (T £(rr, S)\ = |Z (T (ra - t + 1)T£ RC (H, S)|. 

Proof: Similar to Lemma l4~4l we have that Tg RC (H,S) = {u € : A(H, S)u = b, u > 0}, and we 

defer the proof of this claim to Appendix [Cj 

Let u > 0 be such that A(H, S)u = (to — £ + l)b. Consider the vector uo which equals the vector u restricted 
to the first N coordinates. Then A(S')uo = (n — £ + l)bo, where bo is a vector of length |V’(S')| + 1 with one in 
its first coordinate and zeros elsewhere. Hence, uo G £(n; S). On the other hand, Huo = p/3', where (3' consists 
of the last d entries of u. In other words, Huo = 0 mod p and so uo G C(H, 0). 

Therefore, u i-> uo is a map from {u : A(H, 5)u = (n — i + l)b and u > 0} to C(H, 0) D £(n; q , £). It can be 
verified that this map is a bijection. This proves the claimed result. ■ 

As before, we compute the dimension of 7 7 grc(H, S ) and characterize its vertex set. Since the proofs arc similar 
to the ones in Section |4j the reader is referred to Appendix 0 for a detailed analysis. 


Lemma 5.2. Let C(H, 0) and T > grc(H, 5) be defined as above. Suppose further that D(S) is strongly connected 
and C(H, 0) fl Null > oB(D(S')) is nonempty. The dimension of 7 7 grc(H,<S') is |5| — |H(S’)|, while its vertex set 
is given by 

{(lf-w) :CisacycleinC<S| }- 

Let Agrc = lcm{|C| : C is a cycle in I)(S) )U{p). Then Lemma l5.2l Ehrhart’s theorem and Ehrhart-Macdonald’s 
reciprocity imply that ^pg RC ( H ,S)(£) i s a quasipolynomial of degree |5| — |f / (5’)| whose period divides Agrc- As 
in Section |4l we dilate the poly tope 'Pgrc(H.S') by Agrc to obtain an integer poly tope and assume that the 
polynomial L\ Gnc -p GKC (R,S){t) has leading coefficient c. Hence, whenever n — t + 1 = AgrcL that is, whenever 
Agrc I {n — £ + 1 ), 


|C(H, 0) n £(n; 5)| = L AGRcP o Rc(H , 5) (f) = + 0(t |S'|-|^(5')|-i ) 

= c(n/A G Rc) |5HV(5)l + 0(n' 5 l-l y ( s )l- 1 ). 

We denote c/Aq R J' (,S j by c(H, 5) and summarize the results in the following theorem. 

Theorem 5.3. Fix S C [[q] f and d. Choose H and p so that C(H, 0) is an (|S , |,d + 1)-AECC and C(H, 0) fl 
Null>oB(D(iS')) is nonempty. Suppose that Agrc = lcm{{|C'| : C is a cycle in D(S)} U {p}}. Then there exists 
a constant c(H, S) such that whenever Agrc I (n — £ + 1), 


|S|-|R(S)| + 0 („|S| —WS)|—1). 


|C(H, 0) n p Q(n; 5)| > c(H, S)n 
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Theorem 15.31 guarantees that the code size is at least c(H, for some constant e(H. S). In other 

words, when d is constant, we have C(n,d\S ) = fi / (nl s ’HW'S)l)_ Since C(n,d;S ) < |Q(n;S')| = 0 (?tJ 5 ’H v ^' 5 , )I), 
we have C(n,d; S) = 

B. Systematic Encoding of Profile Vectors 

In this subsection, we look at efficient one-to-one mappings from lmj N to p Q(n; S). As with usual constrained 
coding problems, we are interested in maximizing the number of messages, i.e. the size of m N , so that the number 
of messages is close to |pQ(n; 5*) | = We achieve this goal by exhibiting a systematic encoder with 

m = 0(n) and N = |,S'| — V'(.S')| — 1. More formally, we prove the following theorem. 

Theorem 5.4 (Systematic Encoder). Fix n and S C Iqf. Pick any m so that 

n — iP 1 

m < , T . /0 M -. (11) 

(' V< f )] )(q-l) + \S\-\V(S)\-l 

Suppose further that I)(S) is Hamiltonian and contains a loop. Then, there exists a set I C S of coordinates of size 
\S\ — V'TS’) — 1 with the following property: for any v € |m| 1 , there exists an (-gram profile vector u € p Q(n; S) 
such that u|/ = v. Furthermore, u can be found in time 0(|V r (S')|). 

In other words, given any word v of length N = |/| = 15*| — |C(,S') — 1, one can always extend it to obtain a 
profile vector u € pQ(n; S) of length \S\. As pointed out earlier, this theorem provides a simple way of constructing 
(-gram codes from AECCs and we sketch the construction in what follows. 

Fet </> sys (v) denote the profile vector resulting from Theorem 15 .41 given input v. Consider an m-ary (N, fi)-AECC 
C with N = |S| — |F(S)| — 1 and m satisfying (fill) . Fet fi sys (C) = {^ sys (v) : v € C}. Then fi sys (C) C pQ(n;5). 
Furthermore, 4> sys (C) has asymmetric distance at least d since restricting the code cj) sys (C) on the coordinates in I 
yields C. Hence, we have the following corollary. 

Corollary 5.5. Fix n and S C [gj £ and pick m satisfying (fill) . Suppose D{S) is Hamiltonian and contains a loop. 
If C is an m-ary (|5| — |H(5)| — l,d)-AECC, then fi sys (C) = {</> sys (v) : v G C} is a (n,d\S)- GRC. 

For compactness, we write V, A and B, instead of H(S'), A(S) and B(D(S)). To prove Theorem 15.41 consider 
the restricted de Bruijn digraph D(S). By the assumptions of the theorem, denote the set of \V\ arcs in a Hamiltonian 
cycle as H and the arc corresponding to a loop by ao- We set I to be S \ (H U {ao}). 

We reorder the coordinates so that the arcs in H are ordered first, followed by the arc ao and then the arcs 
in I. So, given v = (ni, V 2 , ■ ■ ■, ^m) € [m]^, the proof of Theorem 15.41 essentially reduces to finding integers 
xi, X 2 , ■ ■ ■, x\y\,y such that 

A (x 1 ,x 2 ,...,x\ v \,y,v l ,v 2 , ■ ■ ■ ,v\i\) T = {n-l + l)b. (12) 

Considering the first row of A separately from the remaining rows, we see that (IT 2 T) is equivalent to the following 
system of equations: 

\v\ \I\ 

y^Xj + y = (n-i + 1) 
i=l i= 1 


( 13 ) 
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0 = B 


/ x i \ 

x \v\ 

y 

U\ 


= B 


/ sci \ 


x \v\ 

0 

0 


+ B 


/ 0 \ 


y 

o 


+ B 


( 0 \ 

0 
0 
u\ 


B (xi,...,X|v|, 0,0,... , 0) = 


\ x i ~ x \v\ ) 


v u u\ / v ° j v°y v u \i\) 

Since the first \V\ columns of B correspond to the arcs in H, we have 

/ X2 — X\ \ 

X 3 - x 2 


(14) 


Since the (|C| + l)-th column of B is a 0-column, we have B (0,..., 0, y, 0,..., 0) T = 0 for any y. 

For the final summand, let B (0,..., 0,0, v\,. .., npi) 7 = (ri, r 2 , ..., r\ v \) T . We can then rewrite (fl4l) as 


Xi - x l+ \ = ri, for 1 < i < \V\ - 1. (15) 

Since l 7 B = 0 7 , we have 1 T (r\,r 2 , ■ ■ ■, f\v\) T = X)i=i r * = 0- Furthermore, we assume without loss of generality 
that r i — 0, for all 1 < j < V |. This can be achieved by cyclically relabelling the nodes. 

It suffices to show that an integer solution for (fl5l ) and (IT3T) exists, satisfying y > 1 and xi > 1 for i £ [|V|]. 
Consider the following choices of Xj and y: 

l—l 

Xi = 1 + J2 r v 

3 =1 

1 1 \ |V| 

y = (n -1 + 1 ) - Vi - ~y] xj. 

2=1 2=1 

Clearly, x t and y satisfy (fl5l) and (fl3l ). Since each v t is an integer, all r,; are integers, so x t and y are also integers. 
Furthermore, each x t > 1, since we chose the labeling so that r j — 0- ^e still must show that y > 1. 

First, we observe that rj < (q — l)m for all i, since each vertex has at most (q — 1) incoming ai'cs in I and by 
design, each Vi is strictly less than m. Thus, each Xj satisfies 


Xj < 1 T (i - 1 )(q - 1 )m. 


Summing over all i, we have 


|V| 

E' 

2=1 


FI 


< ^^(f — l)(g — 1 )m = (q — 1 )m 


2=1 


1^1 

2 


Since also each Vi < m, we have 


y>(n 


£ + 1 ) — m 


|J| + («-!) 



By the choice of m, it follows that y > o. This completes the proof of Theorem 15.41 

Example 5.1. Let S = [2j 3 and let n = 20. Then Theorem 15.41 states that there is a systematic encoder that maps 
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words from [2]] 3 into pQ(20; 2, 3). We list all eight encoded profile vectors with their systematic part highlighted 
in boldface. 


14 



13 



11 



10 



For instance, the codeword 000 £ [2j 3 is mapped to the profile vector (18,1, 0,1,1, 0,1, 0). Via the Euler map 
described in Section [TJ this profile vector is mapped to 00 • • • 01100 € Q( 24, 2,3). 

Observe that we can systematically encode [2j 3 into pQ(n; 2, 3) even when n is smaller than 24. In fact, in 
this example, we can systematically encode |2] 3 into pQ(10;2,3). In general, we can can systematically encode 
[m| 3 into pQ(4m + 2; 2,3). In this case, the size of the message set is approximately n 3 /8 while the number of 
all possible closed profile vectors is approximately n 4 /288 

In Section [6] and Example 16.11 we observe that the construction given in Section 15-AI yields a larger code 
size. Nevertheless, the systematic encoder is conceptually simple and furthermore, the systematic property of the 
construction in Section [5H?1 can be exploited to integrate rank modulation codes into our coding schemes for DNA 
storage, useful for automatic decoding via hybridization. We describe this procedure in detail in Section 17-BI 


6. Numerical Computations for S = S(q,£-q*, [ 101 , 102 ]) 

In what follows, we summarize numerical results for code sizes pertaining to the special case when S = 

S{q,£',q*, [wi,w 2 ]). 

By Proposition l3.ll D(q, £: q*. [ 101 , 102 ]) is Eulerian and therefore strongly connected. In other words, Theorem 13. 31 
applies and we have \ Q(n; S)| = ©'(n^H^OS)!), where |S| is given by \S(q,£] q*, [w\, io 2 ])| = YlZ= Wl it) (l~ 
q*Y~ w , while \V(S)\ is given by \S(q,£ - 1 ; q*,[wi - 1, w 2 ])| = YZ= Wl - 1 (^) (Q*) W (Q ~ q*Y~ l ~ w . 

Let D = |5| — \ V(S)\. We determine next the coefficient of n D in \Q(n: ,S')|. When uy = £, the digraph 
D(q, l\ q* , [w\, £]) contains the loop that correspond to the f-gram 1 T . Hence, by Corollary 14.61 the desired 
coefficient is constant and we denote it by c(q,£;q*, [toi,^]). When S = [g] f , we denote this coefficient by c(q,£) 
and remark that this value corresponds to the constant defined in Theorem 13.51 

When W 2 < £, the digraph D(q,£;q*, [wi,W 2 ]) does not contain any loops. Recall from Section |4] the definitions 
of V(S), A s and L-p/g^n — £ + 1). In particular, recall that the lattice point enumerator Lpig^{n — £ + 1) is a 
quasipolynomial of degree D whose period divides A 5 and that consequently, the coefficient of n D in |Q(n;S')| 
is periodic. For ease of presentation, we only determine the coefficient of n D for those values for which A s 
divides {n — £ + 1) or n- £ + 1 = Xgt for some integer t. In this instance, the desired coefficient is given by 
c(q,£;q*, [w 1 ,w 2 ]) = c/Af , where c is the leading coefficient of the polynomial Lx s -p(s)(t)- 
In summary, we have the following corollary. 








































TABLE I 

Computation of c ( q , l ) 
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q 

£ 

D 

c{q,£) 

2 

2 

2 

1/4* 

3 

2 

6 

1/8640* 

4 

2 

12 

1/45984153600* 

5 

2 

20 

37/840810934025 84678400000* 

2 

3 

4 

1/288* 

3 

3 

18 

887/358450977137334681600000 

2 

4 

8 

283/9754214400 

2 

5 

16 

722299813/94556837526637331349504000000 


Entries marked by an asterisk refer to values that were also derived by Jacquet et al. |9l . 


Corollary 6.1. Consider S = S(q,£',q*, [wi,W 2 ]) and define 



Q*) 


t—W 


at //_ i\ 

E ( w ) 

W=W\ — 1 ' ' 


Suppose that \$ = lcm{|C| : C is a cycle in D(S)}. Then for some constant c(q,£',q*, [wi,W 2 ]), 

(i) If w r 2 = £, |Q(n;5)| = c(q,£',q*, [wi,£])n D + O^n 0 ^ 1 ) for all n; 

(ii) Otherwise, if W 2 < £, |Q(n; S)| = c(q, £\ q*, [wi,W 2 ])n D + 0(?r' D ^ 1 ) for all n such that As|(n — £ + 1). 
When S = we write c(q,£) instead of c(q,£; 1, [0,£]). 


We determine c(q,£'i Q*,[wi,w 2 ]) via numerical computations. Computing the lattice point enumerator is a 
fundamental problem in discrete optimization and many algorithms and software implementations have been 
developed for such purposes. We make use of the software LattE, developed by Baldoni et al. Ii22l . which 
is based on an algorithm of Barvinok lf23ll . Barvinok’s algorithm essentially triangulates the supporting cones of 
the vertices of a polytope to obtain simplicial cones and then decompose the simplicial cones recursively into 
unimodular cones. As the rational generating functions of the resulting unimodular cones can be written down 
easily, adding and subtracting them according to the inclusion-exclusion principle and Brion’s theorem gives the 
desired rational generating function of the poly tope. The algorithm is shown to enumerate the number of lattice 
points in polynomial time when the dimension of the polytope is fixed. 

Using LattE, we computed the desired coefficients for various values of (q,£',q*, [^ 1 ,^ 2 ]). As an illustrative 
example, LattE determined c(2,4) = 283/9754214400 with computational time less than a minute. This shows 
that although the exact evaluation of c(q, £) is prohibitively complex (as pointed by Jacquet et al. |9j]), numerical 
computations of c(q,£ ) and c(q. £: q*. [u'i, 102 ]) are feasible for certain moderate values of parameters. We tabulate 
these values in Table U and El 


A. Lower Bounds on Code Sizes 

Next, we provide numerical results for lower bounds on the code sizes derived in Section I5-A1 
When S = S(q, £\ q*, [uq, 102 ]), the digraph D(S) is Eulerian by Proposition 13.11 and hence, 1 belongs to 
Null>oB(U(iS')). Therefore, if C(H, 0) contains the vector 1 as well, C(H,0) n Null>oB(U(5)) is nonempty and 
the condition of Theorem 15.31 is satisfied. Hence, we have the following corollary. 







TABLE II 

Computation of c ( q , t , q *, [wi, W2]). We fixed q = 2 and q * = 1. 
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£ 

W\ 

U >2 

D 

A s 

c( 2 ,£] 1 , [wi,w 2 ]) 

4 

2 

3 

3 

60 

1/360 

4 

2 

4 

4 

- 

1/1440 

5 

2 

3 

6 

120 

1/5184000 

5 

2 

4 

10 

27720 

40337/34566497280000000 

5 

2 

5 

11 

- 

3667/34566497280000000 

5 

3 

4 

4 

420 

23/302400 

5 

3 

5 

5 

- 

23/1512000 

6 

3 

4 

10 

65520 

43919/754932300595200000 

6 

3 

5 

15 

5354228880 

1106713336565579/739506679855711968646397952000000000 

6 

4 

5 

5 

840 

1/518400 


Corollary 6.2. Let S = S(q ,£; q *, [u>i, u^])- Fix d and choose H and p such that C(H, 0) is an (|Sj, d + 1)-AECC 
containing 1. Suppose that Agrc = lcm{{|C71 : C is a cycle in D(S)} U {p}}. Then there exists a constant c(H, S ) 
such that whenever AgrcK™ — £ + 1 ), 


|C(H, 0 ) n pQ(n; S)\ > c(H, + O^" 1 ), 


where D = \S\- |F(5)| = ESU (£)(?T(? ~ 9*)'"™ - E^U-i (tWH? - 
Example 6.1. Let 5 = [2 ] 3 and d = 2. Choose p = 13 and 

1 2 3 5 8 10 11 12 

1 4 9 12 12 9 4 1 




Then C(H, 0) is an (8, 3)-AECC containing 1. We have Agrc = lcm{{l, 2,... , 8} U {13}} = 156. Using LattE, 
we compute the lattice point enumerator of Agrc^grc^’ to be 12168f 4 — 1248f 3 + 131t 2 — 16f + 1. Hence, 
for n = 156f + 2 , the number of codewords in C(EL 0) n£(n; 2 , 3) is given by 12168t 4 — 1248f 3 + 131f 2 — 16t + 1 . 
When t = 1 or n = 158, there exist a (158,3; 2,3)-GRC of size at least 11036. 

We compare this result with the one provided by the construction using the systematic encoder described in Section 
15 -Bl and in particular, Example l5.il When n = 158, we can systematically encode words in |39 ]] 3 into pQ(158; 2,3). 

( 1 2 3 \ 

1 4 4/’ 

we obtain a 39-ary (3, 3)-AECC of size 2368. Applying the systematic encoder in Theorem 15.41 we construct a 
(158, 3; 2, 3)-GRC of size 2368. 


Using LattE, we determined c(H, S) for moderate parameter values and summarize the results in Table UTTl 
We conclude this section with a conjecture on the relation between c(q ,£) and c(H,£). 


Conjecture 6.3. Fix q,£,d. Choose H and p such that C(H, 0) is an (N,d+ 1)-AECC containing 1. Let c(q,£) 
and c(H,S) be the constants defined in Corollaries 16. II and 16.21 respectively. Then e(H, S) > c(q,£)/p d . 

Roughly speaking, the conjecture states that asymptotically, |C(H, 0) n £{n\ q. £)\ is at least Q(n: q. t)\/j>' 1 . In 
other words, for our particular choice of H and (3, we asymptotically achieve the code size guaranteed by the 
pigeonhole principle. 







TABLE III 

Computations of c(H, S ) 
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When S = |2] 3 , we have c(2, 3) = 1/288. 


d 

P 

D 

Agrc 

C (H, S) 

c(2,3 )/p d 

1 

11 

4 

132 

1/3168 

1/3168 

2 

13 

4 

156 

1/48672 

1/48672 

3 

13 

4 

156 

1/632736 

1/632736 

4 

17 

4 

204 

1/24054048 

1/24054048 

5 

17 

4 

204 

1/24054048 

1/408918816 

6 

17 

4 

204 

1/24054048 

1/6951619872 


When S = [2J 4 , we have c(2,4) = 283/9754214400. 


d 

P 

D 

Agrc 

C (H, S) 

c(2,4 )/p d 

1 

17 

8 

14280 

283/165821644800 

283/165821644800 

2 

17 

8 

14280 

283/2818967961600 

283/2818967961600 

3 

17 

8 

14280 

283/47922455347200 

283/47922455347200 


When 5 = 5(2, 5; 1, [2,3]), we have c(2,5;l, [2,3]) = 1/5184000. 


d 

P 

D 

Agrc 

C (H, S) 

c (2, 5; 1, [2, 3])/p d 

1 

23 

6 

2760 

1/119232000 

1/119232000 

2 

29 

6 

3480 

I/ 4359744 OOO 

1/4359744000 

3 

29 

6 

3480 

1/126432576000 

1/126432576000 


7. Decoding of Profile Vectors 

Recall the DNA storage channel illustrated by Fig. |T] The channel takes as its input a word x € Q(n; S ) and 
outputs a vector x G Zl s L Assuming no errors, the vector x corresponds to the profile vector p(x; S ) G pQ(n; S ). 
In this channel model and the code constructions in Section [5] we have implicitly assumed the existence of an 
efficient algorithm that decodes p(x; S ) back to the message x. We explicitly describe this algorithm in what 
follows. 

Let u be a profile vector in pQ(n; S) so that u = p(x; S) for some x G Q(n; S ). As with the proof of Lemma 
I3.2i we construct a multigraph on the node set V(S) by adding u z arcs for each z G V (S'). We remove any isolated 
vertices and we have a connected Eulerian multidigraph. We subsequently apply any linear-time algorithm like 
Hierholzer’s algorithm ll24ll to this multidigraph to obtain an Eulerian walicl and let Euler( u) denote the word of 
\Q\ n obtained from this Eulerian walk. It remains to verify that Euler(u) = x. 

As mentioned in Section [2] an element in Q(n; S) is an equivalence class X C [g] n , where x, x' G X implies 
that p(x; S) = p(x'; S). Here, we fix the choice of representative for X. As hinted by the previous discussion, we 
let this representative be Euler (p(y; S)) for some y G Y and observe that this definition is independent of the 
choice of y. Then with this choice of representatives, the function Euler indeed decodes a profile vector back to 
its representative codeword. 

In summary, we identify the elements in Q(n; S) with the set of representatives {Euler(u) : u G pQ(n; ,3'}. 
Then for any x G Q(n; S), the function Euler decodes p(x; S) to x in linear-time. 

An interesting feature of our coding scheme is that we avoided the assembly problem by designing our codewords 

2 Most descriptions of the Hierholzer’s algorithm involve an arbitrary choice for the starting vertex and the subsequent vertices to visit. 
Hence, it is possible for the algorithm to yield different walks based on the same multigraph. Nevertheless, we may fix an order for the 
vertices and have the algorithm always choose the ‘smallest’ available vertex. Under these assumptions, Euler(u) is always well defined. 
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(a) 


(b) 



Fig. 4. Sequencing by hybridization. Instead of obtaining the exact count of the Z-grams, we obtain auxiliary information on the count: (a) 
we obtain the set of 3-grams present in 00111011000000; (b) we obtain the relative order of the counts of 010, 101 and 111. 


to have distinct profile vectors and profiles at sufficiently large distance. However, there are challenges in counting 
accurately the number of Z-grams and determining the profile vector of an arbitrary word using current sequencing 
technologies. We examine next a number of practical methods for profile counting and decoding and address 
emerging issues via known coding solutions. In our discussion, we assume that S = 

In particular, we look at an older technology - sequencing by hybridization (SBH), proposed in ll25ll - as a 
means of automated decoding. The idea behind SBH is to build an array of /-grams or probes ; this array of probes 
is commonly referred to as a sequencing chip. A sample of single stranded DNA to be sequenced is fragmented, 
labelled with a radioactive or fluorescent material, and then presented to the chip. Each probe in the array hybridizes 
with its reverse complement, provided the corresponding /-grain is present in the sample. Then an optical detector 
measures the intensity of hybridization of the labelled DNA and hence infers the number of /-grams present in the 
sample. 

A. Detecting presence or absence of i-grams 

In the initial studies of SBH, hybridization results only indicated the presence or absence of certain /-grams. 
In our terminology, if x is the codeword, the channel outputs a subset of |[q] given by supp(p(x; q, Zj), where 
supp(u) denotes the set of coordinates z with u z > 1 (see Fig. (4} a)). Then, we can define <Z* ram (x. y: q. Zj = 
|supp(p(x; q,Z))Asupp(p(y; q, Z))| for any pair of x,y G [g] n . 

As before, ([qj n , dg ram ) forms a pseudometric space and we convert this space into a metric space via an 
equivalence relation - we say x ~ y if and only if d* ram (x, y; q, Zj = 0. Then, by defining Q* 
we obtain a metric space. 

Let C C Q*{n\q,l). If d = min{<Z* ram (x, y; £) : x, y € C,x / y}, then C is said to be (n,d\q, £)-£*- gram 
reconstruction code (*-GRC). 

We have the following proposition that is an analogue of Proposition 12.21 

Proposition 7.1. Given an (n,d;q,£)-*- GRC, a set of n — Z + 1 — [(d — l)/2j Z-grams suffices to identify a 
codeword. 

Proof: Let Z = n — Z + l— [ (d — l)/2j. Suppose otherwise that there exists a pair of distinct codewords x 
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and y that contain a common set of t £- grams. Then 

^gram (x,y;f) = |supp(p(x; q, T))Asupp(p(y; q,£))\ 

<{n — £+1 — t) + {n — £+1 —t) = 2 |_(d — l)/2j) < d — 1 < d, 


resulting in a contradiction. ■ 

Determining the maximum size of an (n,d\q ,£)-*-GRC turns out to be related to certain well studied combina¬ 
torial problems. 

Case d = 1. The maximum size of an (n, 1; q,/?)-*-GRC is given by \Q*(n\q,£)\. Equivalently, this count 
corresponds to the number of possible sets of /-grams that can be obtained from words of length n. Observe that 
\Q*(n;q,£)\ < 2 q ‘ and hence \Q*(n;q,£)\ cannot be a quasipolynomial in n with degree at least one. Therefore, 
it appears that Ehrhart theory is not applicable in this context. Nevertheless, preliminary investigations of this 
quantity for q = 2 have been performed by Tan and Shallit | [26l . In particular, Tan and Shallit proved the following 
proposition for n < 2£. 


Proposition 7.2 ( lf26l Corollary 19]). For £ < n < 21, we have 

k =1 d\k V 7 

where /r(-) is the Mobius function defined as 


p,(n) 


1, if n is a square-free positive integer with an even number of prime factors; 
< 1, if n is a square-free positive integer with an odd number of prime factors; 
0, otherwise. 


Case d = 2 (n — £ + 1). For the other extreme, we see that the problem is related to edge-disjoint path packings 
and decompositions of graphs (see ed, my Formally, consider a graph G. A set C of paths in G is said to be 
an edge-disjoint path packing of G if each edge in G appears in at most one path in C. An edge-disjoint path 
packing C of G is an edge-disjoint path decomposition of G if each edge in G appears in exactly one path in C. 
Edge-disjoint cycle packings and decompositions are defined similarly. 

Now, an (n, 2(n — £ + \);q, /?)-*-GRC is equivalent to an edge-disjoint path packing of D(q, £), where each path 
is of length (n — £ + 1). Furthermore, an edge-disjoint path decomposition of D(q, £) into paths of length n — £ + 1 
yields an optimal (n, 2(n — £ + 1); q, /!)-*-GRC of size q l /{n — £ + 1). 

Since an edge-disjoint cycle decomposition is also an edge-disjoint path decomposition, we examine next edge- 
disjoint cycle decomposition of de Bruijn graphs. These combinatorial objects were studied by Cooper and Graham, 
who proved the following theorem. 


Theorem 7.3 ( lf29l Proposition 2.3, Corollary 2.5]). 

(i) There exists an edge-disjoint cycle decomposition of D(q ,£) into q cycles of length q f ~ 1 , for any q and t. 

(ii) There exists an edge-disjoint cycle decomposition of D(r2 k+1 , 3) into 8 /,: cycles of length 8r 3 , for any /,: > 0 
and r > 1. 


Therefore, Theorem 17.31 demonstrates the existence of an optimal (q e 1 + £ — 1, 2 q e 1 ; q, /')-*-GRC of size q and 
an optimal (8r 3 + 2, 16r 3 ; r2 fc+1 ,3)-*-GRC of size 8 k for any k > 0 and r > 1. 
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Fig. 5. Encoding messages for a DNA storage channel that outputs the relative order on the counts of particular (-grams. 


B. Detecting the relative order of t-grams 

As mentioned earlier, it is difficult to infer accurately the number of (-grams present from the hybridization 
results. However, we may significantly more accurately determine whether the count of a certain (-gram is greater 
than the count of another. In other words, we may view the sequencing channel outputs as rankings or orderings 
on the (/ (-grams counts or a permutation of length q e reflecting the (-gram counts. 

This suggests that we consider codewords whose profile vectors carry information about order. More precisely, 
let Perm(iV) denote the set of permutations over the set {N}. We consider codewords whose profile vectors belong 
to Perm(A T ) and consider a metric on Perm(iV) that relates to errors resulting from changes in order. The Kendall 
metric was first proposed by Jiang et al. lf30l in rank modulation schemes for nonvolatile flash memories and codes 
in this metric have been studied extensively since (see PTl and the references therein). The Ulam metric was later 
proposed by Farnoud et al. for permutations ll32l and multipermutations l l33l . 

Unfortunately, due to the flow conservation equations ([!]), the profile vector of a g-ary word is unlikely to have 
distinct entries and hence be a permutation. Nevertheless, we appeal to the systematic encoder provided by Theorem 
15.41 We set m = </ — </- { — 1. Then, provided n is sufficiently large, there exists a set I of rn coordinates that allow 
us to extend any word v in [m] m to a profile vector in g> sys (v) £ pQ(n; q. (). In particular, since Perm(m) C [m] m , 
any permutation v of length m may be extended to a profile vector in f sys (y) £ p Q(n; q. (). 

This implies that for the design of the sequencing chip, we do not need to have q f probes for all possible (-grams. 
Instead, we require only m = q l — q e ~ 1 — 1 probes that correspond to the (-grams in I. Hence, the sequencing 
channel outputs an ordering on this set of m (-grams (see Fig. @}b)). 

This setup allows us to integrate known rank modulation codes (in any metric) into our coding schemes for DNA 
storage. In particular - , to encode information we perform the following procedure. First, we encode a message is 
into a permutation using a rank modulation encoder. Then the permutation is extended into a profile vector and 
then mapped by Euler to the profile vector of a g-ary codeword (see Fig. [5] for an illustration). 

Example 7.1. Suppose that S = [2] 3 . Hence, we set m = 3 and recall the systematic encoder © sys described 
in Example 15.11 that maps |3j 3 into pQ(14;2,3). Suppose that v = (0,1,2) £ Perm(3) belongs to some rank 
modulation code. Then u = 0 sys (v) = (3,1,0, 2,1,1, 2, 2) belongs to pQ(14; 2, 3). Finally, Euler maps u to a 
codeword 00000110111100 £ [2j 14 . 

Now, if we were to detect the relative order of the 3-grams 010, 101 and 111, we obtain the permutation (0,1, 2) 
as desired (see also Fig. @Jb)). 
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Appendix A 

Eulerian Property of Certain Restricted De Bruijn Digraphs 

In this section, we provide a detailed proof of Proposition 13.11 Specifically, for q, l, 1 < q* < q — 1 and 
1 < w\ < W 2 < £, we demonstrate that the digraph D(q,£;q*, [mi, m 2 ]) is Eulerian. Our analysis follows that of 
Ruskey et al. fl5l . 

Recall that the arc set of D(q,£-,q*, [mi, m 2 ]) is given by S = S(q,£;q*, [mi, m 2 ]), while the node set is given 
by V(S) = S(q,£ — 1 ;q*, [mi — 1,W2]), which we denote by V for short. In addition, we introduce the following 
subsets of [(/]. For a node z in V, let Pref(z) be the set of symbols in [</] that when prepended to z results in an 
arc in S. Similarly, let Suff(z) be the set of symbols in [[q]] that when appended to z result in an arc in S. Hence, 
{az : cr € Pref(z)} and {zcr : a € SufF(z)} are the respective sets of incoming and outgoing arcs for the node z. 

Lemma A.l. Every node of D(q,£\q*, [mi, m 2 ]) has the same number of incoming and outgoing arcs. 

Proof: Let z belong to V. Observe that for all s £ [< 7 ], s z £ S if and only if z s € S. Hence, Pref(z) = Suff(z) 
and the lemma follows. ■ 

It remains to show that D(q, £\ q*. [mi, m 2 ]) is strongly connected. We do it via the following sequence of lemmas. 

Lemma A.2. Let z, z! belong to V and have the property that they differ in exactly one coordinate. Then there 
exists a path from z to z!. 


Proof: Observe the following characterization of Pref(z) = Suff(z): 


Pref(z) = Suff(z) 


[q -q*,q- i], 
< [?*!, 

M > 


if wt(z; q*) = w\ — 1; 
if wt(z; q*) = m 2 ; 

otherwise. 


Then Suff(z) n Pref(z') is empty only if wt(z; q*) = w\ — l and wt(z': q*) = m 2 . Therefore, z and zl differ in at 
least two coordinates, which contradicts the starting assumption. 

Hence, Suff(z) nPref(z') is always nonempty. To complete the proof, let s £ Suff(z) nPref(z'). Then, the path 
corresponding to zsz! is the desired path. (Note that each (-gram appearing in zsz! has weight equal to either 
wt(zs) or wt(sz'); in particular, each such Agram lies in S.) ■ 

Therefore, to construct a path between any two given nodes z and z !, it suffices to demonstrate a sequence of 
nodes such that consecutive nodes differ in only one position. 


Lemma A.3. For any z, z! £ V, there is a sequence of nodes z = zo, zi,..., z t = zl such that z j and z J+ i differ 
in exactly one position for 3 e [t]. 

Proof: Let zl = c >\02 ■ ■ ■ We construct the sequence of nodes inductively. Suppose that for some j, 
Zj = or <72 • • • (7jTj_|_i • • • ti_i, with Tj + i f <7j + i. Our objective is to construct a sequence of nodes terminating at 
Zj>, such that zy = <7i<t 2 • • • <jj<j,: + ir' +2 ■ ■ • r^_ 1 for some <+i , t [_|_ 2 , • • ■, r)_ ! . Hence, by repeating this procedure, 
we obtain the desired sequence of nodes that terminates at zl. 

For the inductive step, if <7i<7 2 • • • cr t a l+ 1 r ,; + 2 ■ ■ ■ t>_ 1 £ V, we establish the claim by setting 


Z j+1 — (J l (J 2 ' ' ' ViCi+lTi+2 ' ' ' T£-\. 

In the general case, we have to consider the following scenarios: 

(i) When wt(z j\q*) = w\ — l, Tj+i € [q — q*,q — l] and <7j + 1 ^ [q — q*,q — 1], there exists some Tk in zj that does 
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not belongs to [q — q*,q — 1]. Otherwise, wt(<ri ■ ■ ■ ay q*) = w\ —i + i and so wt(<ri • • • q*) = w\ — £ + i. 

Then, wt(z ’-,q*) < v)\ — 2, contradicting the fact that z ' £ V. Therefore, we have the sequences: z j = 

■ ■ ■ <TjTj + lTj + 2 • • • Tfc • • • T£- 1 , Zj +1 = (Ji ■ ■ ■ (TjTj + lT ,: + 2 • • • 1 • • • T£- 1 , and Zj + 2 = (T\ ■ ■ ■ <7j<7j+lTj_|_2 •••!••• T£- 1 . 
(ii) When wt(z.,; q*) = m 2 , Tj+i [q — q*, q — 1] and er,;+i € [q — q*, q — 1], then there exists some 7 ^ in Zj that 
belongs to [q — q*, q — 1]. Otherwise, wt(ui • • • < 7 *; q*) = m 2 and so wt(z'; q*) > wt(cii • • • cr,:+i; q*) = m 2 + 1, 
contradicting the fact that z' £ V. Therefore, we have the sequences: z j = a\ ■ ■ ■ a,r, + ir ;+ 2 • • • T/■ • • • rr_i, 
Zj+l = CTi • • • <7iTj + lTj_|_2 • • • 0 ■ ■ ■ T£_!, and Zj + 2 = 0"1 • ■ ■ 0 iCTj+lTi +2 • • • 0 • • • T£_ 1 - 

■ 

Consequently, D(q,£\q*, [mi, m 2 ]) is strongly connected. Together with Lemma lA. 11 this result establishes that 
D(q,i\q*,[w\, w 2 ]) is Eulerian. 


Appendix B 

Proof of Coroflary I4.6I 

We provide next a detailed proof of Corollary 14.61 Specifically, we demonstrate Proposition IB. II from which the 
corollary follows directly. For the case that S = lqf\ Jacquet et al. established a similar result by analyzing a sum 
of multinomial coefficients. This type of analysis appears to be to complex for a general choice of S. 

Proposition B.l. Suppose that D(S) is strongly connected and that it contains loops. Let t = n — £ + 1, D = 

|S'| — |V’(S')| and let the lattice point enumerator of V(S) be L-p^(t) = CD(t)t D + Then, co(t) is 

constant. 

To prove this proposition, we use the following straightforward lemma. 

Lemma B.2. Suppose that D(S) is strongly connected and that it contains loops. For all t, we have L-p^g\(t +1) > 

Lv(s)(f)- 

Proof: It suffices to show that there is an injection from F(n] S) to F(n + 1; S ). Suppose that u £ F(n; S ), 
so that A(S)u = tb. Fix a loop in D(S) and consider the vector x(z), where z is the arc corresponding to the 
loop. Then, A(S)%(z) = b and A(S)(u + x( z )) = (t + l)b. So, the map u i-a u + x( z ) i s an injection from 
F{n-S) to F(n + 1;5). ■ 

Proof of Proposition \B.1\ Lemma IB. 21 demonstrates that P-piS) i s a monotonically increasing function. 
Intuitively, this implies that the coefficient of its dominating term co(f) cannot be periodic with period greater 
than 1. We prove this claim formally in what follows. 

Suppose that cp> is not constant and that it has period r. Hence, there exists t a ^ 4 mod r such that CD(t a ) = od, 
cd ifb) = t>D and aj) < bg>. Furthermore, define a, = Ci(t a ) and bi = c,.(4) for 0 < i < J) - 1. and consider the 
polynomial Ylp=o bit 1 — aft + Tf. By construction, this polynomial has degree D and a positive leading coefficient. 
Hence, we can choose t,\ = t a mod r and f 2 = tb mod r so that t\ < t 2 < t\ + r and ffiLo ^2 — a ifi + r)* > 0. 
Consequently, 

D D D 

L V(S)(ti+T) = ^2c i (t 1 +T)(t 1 +r) 1 = ^2aft 1 +T) 1 < bit l 2 = L v{s) (t 2 ), 
i=0 z=0 Z—0 

contradicting the monotonicity of Lp( S y ■ 
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Appendix C 

Properties of the Polytope "Pgrc(H, S ) 

We derive properties of the polytope "Pgrc(H, S ) described in Section 15-Al In particular, under the assumption 
that D(S) is strongly connected and C(H, 0) fl Niill>nB(/)(,S')) is nonempty, we demonstrate the following: 

(Cl) The dimension of the poly tope T ) grc(H, 5) is |S| — | V'(jS') |; 

(C2) The interior of the polytope is given by {u £ : A(H, 5)u = b, u > 0}; 

(C3) The vertex set of the polytope is given by 

{(W ,! ^) :Cisacyciein D(S) }' 

Since C(H, 0) fl Null > oB(D(S')) is nonempty, let uo belong to this intersection. Then Huo = 0 mod p, that is, 
Huo = p/3 for some (3 > 0. Let p = luo- If we set u = ^(uo,/3), then A(H, S')u = b, with u > 0. 

Observe that the block structure of A(H, S ) implies that it has rank |V(5)| + d. Hence, the nullity of A(H, S) 
is | S\ — |V(5)|. As before, let ui,U 2 ,..U|g|_|y(m| be linearly independent vectors that span the null space of 
A(H, S). Since u has strictly positive entries, we can find e small enough so that u + eu* belongs to "Pgrc(H, S) 
for all i £ [|5| — |H(5)|]. Therefore, {u, u + eui, u + eu 2 ,..., u + eu|s|_|y(g)|} is a set of (S’! — |1/(S')| + 1 affinely 
independent points in Pgrc(H, £). This proves claim (Cl). 

For the interior of T*grc(H, S'), first consider u / > 0 such that A(H, S , )u / = b. For any u" £ 7 ? grc(H,5), 
we have A(H, S)u" = b and hence, A(H, S)(u' — u") = 0. Since ri' has strictly positive entries, we choose e 
small enough so that u' + e(u' — u") > 0. Therefore, u' + e(u' — u") belongs to 'Pgrc(H. S) and u' belongs to 
the interior of 'Pgrc(H,S'). 

Conversely, let u' G T’grc(H. S) with w' = 0 for some coordinate j. Let u be as defined earlier, where 
u € P G rc(H. S) with u > 0. Hence, for all e > 0, the jth coordinate of u' + e(u / — u) is given by —euj, which 
is always negative. In other words, u' does not belong to interior of TYjRcfH, S). This characterizes the interior 
as described in claim (C2). 

For the vertex set, observe that j : C is a cycle in _D(S) j C PgrcIH, S). 

Let v € 7 ? grc(H,S') and suppose that v = (vi,V 2 ) is a vertex. Since v £ Pgrc(H, S), we have V 2 = ^Hvi 
and B(D(S))vi = 0. Proceeding as in the proof of Lemma 1431 we conclude that vi = %(C)/|C'|, for some cycle 
in D(S) and hence, v = ppp). 

Conversely, we show that for any cycle C in D(S), 1 ) cannot be expressed as a convex combination 

of other points in 'PgrcKH, S’). Suppose otherwise. Then we consider the first \S\ coordinates and we proceed as 
in the proof of Lemma 14.51 to yield a contradiction. This completes the proof of claim (C3). 
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